Spark v2 design & migration

Overview of Spark v1 (the current version)

Database of eligible deals

Updated manually once per week

Verified StorageMarket deals only

Eligible deal is (minerId, payloadCid, clientId) , built as follows:

{
  minerId: DealProposal.Provider // f0...
  payloadCid: DealProposal.Label // bafy...
  clientId: DealProposal.Client  // f0...
}

Retrieval check (v1)

Query blockchain state (Filecoin.StateMinerInfo(minerId, null)) to obtain miner’s peer ID

Query IPNI at https://cid.contact to find the advertised retrieval provider for payloadCid that matches the miner’s peer ID

Fetch payloadCid using the advertised provider address and retrieval protocol (HTTP or Graphsync)

Spark v2 design (the upcoming version)

Database of eligible deals

Updated in near-real-time by observing actor events

All verified deals (StorageMarket, DDO, potentially other deal types added in the future)

An eligible deal is (minerId, pieceCid, pieceSize, clientId) built as follows using the claim metadata (docs):

{
	minerId: metadata.provider      // 1660795 means f01660795
	pieceCid: metadata.piece-cid    // baga...
	pieceSize: metadata.piece-size  // 34359738368
	clientId: metadata.client       // 2147046 means f02147046
}

Additionally, we need to keep track of the lifetime of the sector and claim to understand the start and end of the time window in which the deal is “active” and expected to be retrievable. All information should be available in the actor events, but we need to research how exactly to determine the start and end.

Retrieval check (v2)

Query blockchain state (Filecoin.StateMinerInfo(minerId, null)) to obtain miner’s peer ID.

Build contextID from (pieceCid, pieceSize) using the same algorithm as Curio (source code).

Ask Spark Piece Indexer (later IPNI Reverse Index) for a sample of one payload block (payloadCid) indexed for this context & miner’s peer ID.

Query IPNI at https://cid.contact to find the advertised retrieval provider for payloadCid that matches the miner’s peer ID.

Fetch payloadCid using the advertised provider address and retrieval protocol (HTTP only).

Requirements comparison

Spark v1 has the following requirements for SP software & configuration:

FIL+ deals must be made using the StorageMarket actor

DealProposal must have the Label field set to the payload root CID (starting with bafy|bafk|Qm)

SP must advertise deal payload blocks to IPNI. The advertisements can be served over any protocol supported by IPNI (Graphsync, HTTP).

The advertised retrieval must support Trustless HTTP Gateway or Graphsync protocol (or both).

Spark v2 has different requirements:

FIL+ deals can be made in any way that triggers claim and sector-activated events (StorageMarket, DDO, etc.).

SP must advertise the deal’s payload blocks to IPNI.
- The advertisement's ContextID must be constructed from (pieceCid, pieceSize) using the same method as Curio (serialize PieceInfo to CBOR — I’d like to get this included in the IPNI spec eventually).
- ~~Until the IPNI Reverse Index is shipped, SP must serve the IPNI advertisements over HTTP. (This is no longer relevant, IPNI has already dropped support for Graphsync-based ingestion.)~~

The advertised retrieval must support Trustless HTTP Gateway protocol.
Note: This is an artificial limitation; we could easily support Graphsync retrievals. However, Spark v2 gives us a great opportunity to drop support for Graphsync. This will push SPs to adopt Trustless HTTP Gateway retrievals, which is what the Filecoin builders want.

Miner SW compatibility table (OUT OF DATE)

Miner SW	Spark v1	Spark v2
boost (as of Q4 2024) (StorageMarket + DDO deals)	🟠 To enable Spark to check DDO deals, the SP must advertise Graphsync retrievals (preferably alongside HTTP).	🔴 Builds ContextID in an unsupported way
curio-boost (Jan 2025) (StorageMarket + DDO deals)	🟢	🟢
venus droplet StorageMarket + DDO deals (??)	🟠 To enable Spark to check DDO deals, the SP must advertise Graphsync retrievals (preferably alongside HTTP).	🔴 Builds ContextID in an unsupported way. The Venus team needs ~2 weeks to implement & ship the required changes.

See the Implementation section in https://github.com/filecoin-project/FIPs/pull/1089 for the up-to-date status.

Migration from v1

As can be seen from and , the switch from Spark v1 to v2 can leave many miners with no Spark RSR score until they meet the new requirements. SPs running Venus Droplet must wait until Venus implements our requirements.

On our side, switching from Spark v1 to v2 involves changes in several components, it would be tricky to perform this change atomically.

There are many unknown unknowns about the v2 design in respect to how well will actor events, Curio and Piece Index work together. It’s very likely we will need several weeks to iron out rough edges and make Spark v2 scores reliable & trustworthy.

For the reasons above, I propose running Spark v1 and v2 measurements alongside until we have enough confidence in v2. We should make it clear that we expect this period to be short (1-3 months max), and we will shut down Spark v1 afterwards.

I’d like us to explore two options on how to pull this off:

Create Spark v2 as a new checker project similarly to how we created Voyager.
- This means duplicating some parts of our infrastructure, though.
- There will be two Station modules - Spark v1 and Spark v2.
- There will be two IE smart contracts. We need to split the monthly rewards budget between them.

Keep one Spark module, but let it run both v1 and v2 algorithms concurrently.

Overview of components

Meridian smart contract	No changes; can be cloned or shared
spark-api round tracker	No changes; can be cloned or shared
spark-api deal sampling	v1 and v2 use different DB of deals. The task format can be extended to support both v1 and v2 definitions.
spark-checker round tracker	No changes; can be cloned or shared
spark-checker retrieval impl.	v1 and v2 use different steps
spark-checker measurement	v2 will need to add additional fields to indicate whether we found a sample of deal payload blocks in the IPNI advertisements
spark-publish	We must add v2 fields to published measurements. It’s easy to modify spark-publish to handle both v1 & v2 measurements.
spark-evaluate measurement listener & evaluation loop	No changes; can be cloned or shared
spark-evaluate fraud detection / evaluation	In v2, we must check consensus about availability of payload sample.
spark-evaluate platform metrics writer	No changes; can be cloned or shared
spark-evaluate retrieval metrics writer	We want to distinguish retrieval-related metrics produced by v1 and v2. Additionally, for v2, we want to track how many deals/retrievals were able to find a sample of payload blocks. This can be handled uniformly by setting the value to 100% for v1 measurements.
spark-stats platform metrics	No changes; can be cloned or shared
spark-stats retrieval metrics	We want to distinguish retrieval-related metrics produced by v1 and v2.
Dashboards - platform metrics	No changes; can be cloned or shared
Dashboards - retrieval metrics	We want to distinguish retrieval-related metrics produced by v1 and v2.

Help SPs diagnose their compliance with Spark v2

Idea: we can enhance Spark v1 with new checks performed as part of each retrieval tasks.

We already detect when SP advertises only Graphsync retrievals. Let’s radiate this information in our dashboards and even in the FIL+ compliance tooling.

When we find the retrieval advertisement in the IPNI response, check if ContextID can be parsed as PieceInfo in Spark v2 format. Flag when it cannot be parsed. Again, let’s make this information visible in our dashboards.
This is a necessary but not sufficient condition. To get more confidence, we would need to modify fil-deal-ingester to store PieceInfo in the Eligible Deals DB and propagate it to Round Retrieval Task List. Then the checker nodes would need to construct ContextID from task’s PieceInfo and check whether it matches the ContextID returned by IPNI.
Another possible hardening is to let Spark v1 checker nodes query https://pix.filspark.com/sample/{peer-id}/{context-id} to verify that Spark Piece Indexer was able to ingest the deal being checked.

Finally, we need to check whether the SP serves IPNI advertisements over HTTP. This can be checked on-demand by querying https://pix.filspark.com/ingestion-status/{peer-id}

Spark v1.5 - DDO hack

How can we make Spark test DDO deals with as little work as possible?

Idea for a hacky solution:

Keep the Retrieval Task defined as (minerId, payloadCid)
- This means most of the Spark stack can stay as-is: checkers, spark-api, spark-evaluate, spark-stats, dashboards, etc.

Use IPNI Graphsync metadata to map DDO PieceCID to PayloadCID
- This requires SPs to advertise Graphsync retrievals for DDO deals. They don’t have to serve Graphsync retrieval requests; they only have to advertise the availability of such retrievals to IPNI.
- Is it ugly? Yes. k
- Does it work with the existing SW like Boost? Yes!
- How much will we build and then throw away after Spark v2: Not much in this component, https://github.com/filecoin-station/piece-indexer already provides PieceCID→PayloadCID mapping based on Graphsync metadata.

How to ingest DDO deals and merge them with the current f05 fil-deal-ingester database:
- Enhance https://github.com/filecoin-station/fil-deal-ingester to store PieceCID and PieceSize alongside Label CID in the database of eligible deals.
- Build spark-observer for DDO as planned for Spark v2. This will produce a list of (minerId, clientId, PieceCID, PieceSize) rows in an SQL table. Add PayloadCID set to NULL for newly ingested deals.
- Periodically scan the table produced by spark-observer and for each deal that does not have PayloadCID set, ask piece-indexer for a sample payload block CID and store the CID in our table if found. If not, then we need to retry the query, because it can take some time until IPNI advertisements are processed for the new deal. We can stop retrying after some configured time, e.g. 3 days.
- Periodically cross-check the list of deals created by fil-deal-ingester (f05) with the list of deals created by spark-observer (DDO).
  - There should be a lot of overlap, because f05 deals will be ingested by both services.
  - Find deals that are in the spark-observer database only and which have PayloadCID value filled in. Add these deals to the “master” table of deals eligible for retrieval checking.

Upsides

All changes are isolated to the step “ingest new eligible deals”. No changes are needed in other components (spark-checker, spark-api, spark-evaluate, spark-stats, dashboards).

We don’t need to build infrastructure to run two modules (Spark v1 & Spark v2).

Downsides

A breaking change - the RSR reported by Spark will likely change because we will start sampling a different set of deals.

Less visibility into the process. We may need to build additional diagnostic tooling allowing SPs to understand which of their deals are considered by Spark as eligible for retrieval testing, and for deals that are not included, explain why.

Gameable: We will only test deals advertised to IPNI with Graphsync metadata. SP can selectively advertise only some deals to influence what Spark will test. (This is not a problem in Spark v2, where a deal not advertised to IPNI is flagged by the checkers.)

Throw-away work: the code backfilling PayloadCID for DDO deals and merging fil-deal-ingester deals with deal-observer deals won’t be used in Spark v2.

Ugliness: We will require SPs to advertise Graphsync retrievals, but we want them to server Trustless HTTP GW retrievals instead.

Curio does not support nor advertise Graphsync. But then they don’t support DDO deals yet.

Another iteration:

In the list of the eligible deals, keep track of which eligible deals were ingested from f05 and which were from DDO.

Produce two datasets - one for f05 deals only, another for the combined dataset.

This way we can keep most of the stack the same for Spark v1 and v2 (spark-api, spark-publish, spark-evaluate). What we need to change/duplicate is the code aggregating evaluated measurements into metrics like RSR. Plus update all dashboards, of course.

Clarification

Let me briefly explain:

When we observe f05 deals, we get a DealProposal which contains a Label field with the root block CID (by convention).

When we observe DDO sector events, there is no Label, only PieceCID and PieceSize.

My initial goal - the original Spark v2 plan - was to stop observing f05 and start observing DDO.

The problem with that approach:

To discover payload CIDs from PieceCID and PieceSize, we need to impose additional requirements on SPs.

There is no miner SW meeting all of those requirements yet.

Even when there is some miner SW version that works with Spark v2 in the future, the adoption by SPs will take months.

As a result, if we switch from Spark v1 to v2, a lot of SPs with high Spark RSR will suddenly have zero RSR and won’t pass allocator compliance checks.

With Spark v1.5, I want to:

Keep observing f05 deals for backwards compatibility - to preserve the same RSR for miners that already have high Spark RSR.

Augment this list of deals with DDO-based deals to allow DDO-only miners to get Spark RSR score too.

To work around the lack of support in miner SW for Spark v2 requirements, I proposed to use IPNI metadata for Graphsync retrievals in Spark v1.5. It’s ugly, but it bridges the gap.—Next, what data to collect:

I want to enhance the list of eligible deals with a flag whether the deal was sourced from f05 DealProposals or DDO sector events.

When linking measurements to tasks, we can copy this flag to each measurement - we already do the same to link measurements to storage deal clients.

We are told that most SPs uses only one deal type, either f05 or DDO. If that’s true, Spark v1.5 will not decrease the RSR of miners that are already compliant, it will rather provide a score to providers with no Spark RSR score yet.

In the initial Spark v1.5 release, we can change the RSR calculation to combine the measurements from f05 and DDO deals. Only if we discover that some miners have lower combined score we need to implement additional measures.

For example:

When aggregating data, we can duplicate field in each table (and REST API response). Where we have successfull and total now, we can add successfullDDO and totalDDO.

This way, we start collecting data about the retrievability of DDO deals but don’t expose that data yet. Nothing changes for miners and FIL+ tooling yet.

The next step is to review the data we are seeing.
1. How many SPs have a good successful/total rate but a poor successfullDDO/totalDDO rate and vice versa?
1. Ideally, we will see that SPs use only one deal type (f05 or DDO) or have the same RSR for both deal types.
1. If that’s the case, we can “upgrade” all existing metrics to combine results for both deal types.
1. If not, then we have data to drive our decision making about what to do next.

TODOs:

Define what exactly is the breaking change? v1, v1.5, v2.

Are there any existing DDO FIL+ deals? Ask Will Scott about this. What is the impact on Spark RSR if we add DDO deals to the mix of deals tested?
I think the big effect will be to provide a score to providers with no score rather than change current provider scores
SPS tend to onboard with one mechanism
Q: How many DDO non-F05 FIL+ deals are there?
A: in last month 10% before that 25%